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Abstract 

We show that the inequality H{A\B, X) + H{A\B,Y) A, H{A\B) for jointly distributed random variables 
A , B, X , Y , which does not hold in general case, holds under some natural condition on the support of the probability 
distribution of A, B, X,Y. This result generalizes a version of the conditional Ingleton inequality: if for some 
distribution I(X : Y\A) = H{A\X, Y) = 0, then I {A : B) < I (A : B\X) + I (A : B\Y) + I(X : Y). 

We present two applications of our result. The first one is the following easy-to-formulate combinatorial theorem: 
assume that the edges of a bipartite graph are partitioned into K matchings such that for each pair (left vertex x, 
right vertex y) there is at most one matching in the partition involving both x and y\ assume further that the degree 
of each left vertex is at least L and the degree of each right vertex is at least II. Then I\ 'y LR. The second 
application is a new method to prove lower bounds for biclique coverings of bipartite graphs. 

Index Terms 

Shannon entropy, conditional information inequalities, non Shannon type information inequalities, biclique 
covering 


I. Introduction 

The most general and fundamental properties of Shannon’s entropy can be expressed in the language of linear 
inequalities. The usual universal information inequalities (the linear inequalities that hold for Shannon’s entropies 
of jointly distributed tuples of random variables for every distribution) have many equivalent characterizations 
and interpretations in very different areas — these inequalities can be equivalently reformulated in the settings of 
Kolmogorov complexity and group theory; they give characterizations of the network coding capacity rates, of the 
cardinalities of projections of finite sets, etc., see the surveys in fTTl . ll5l . The parallelism and interplay between 
different “incarnation” of information inequalities lead to their better understanding and to more efficient applications 
of this technique. However, there exists a class of more exotic information inequalities that still lack a satisfactory 
explanation and have no clear combinatorial interpretation. These are the conditional linear information inequalities, 
which hold only for distributions that satisfy some constraints. The first nontrivial example of a conditional linear 
information inequality was proven in the seminal paper (2); see a survey of other similar results in lfl2l . Until now, 
these inequalities looked like artifacts without practical or theoretical application. In this paper, we argue that some 
conditional inequalities can be naturally interpreted in purely combinatorial terms. We propose a new “conditional 
information inequality,” discuss its combinatorial meaning, and show how it can be employed in pure combinatorial 
proofs. 

Let A, X, Y be jointly distributed discrete random variables. In this paper, we consider the inequality 

H(A\X) + H(A\Y) ^ H{A ), (1) 

where Hf) stands for Shannon’s entropy. For some A. X. Y this inequality is false, e.g., for constant X, Y and 
non-constant A. We provide a natural condition on the distribution of A, X, Y implying inequality ([[]). Then we 
provide two combinatorial applications of the resulting conditional inequality and show that it implies the conditional 
inequality from l ITOj l. 
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More specifically we consider the following condition: 

for each quadruple a, a . x, y, if the probabilities of all four events 

[A = a, X = x\,\A = a,Y = y] , [A = a' , X = x], [A = a', Y = y] (2) 

are positive, then a = a'. 

Theorem 1. The inequality © holds for all random variables A. X. Y satisfying ©. 

We first prove this theorem and then show its combinatorial applications. 


II. Notation 

To simplify formulas, we use the following notation for the marginal distributions (conditional and unconditional): 
p(a ) denotes Pr[A = a], p(a,x ) = Pr[A = a, X = x], p{a\x) = Pr[A = a\X = x\, p(a,y) = Pr[A = a, Y = y\, 
and so on. 

If X is a random variable and £ is an event in the same probabilistic space (and Pr[£] > 0)), we denote by 
X\£ the conditional distribution of X, i.e., the restriction of X on the subspace corresponding to the event £. For 
example, for jointly distributed random variables (X, Y) we denote by X\(Y = y) the conditional distribution of 
X under the assumption Y = y. 


ITT. The proof of Theorem Q] 

We apply the method of ©, 0. The crucial property of inequality © is that no term contains both X and Y. 
The inequality © can be re-written in terms of unconditional entropies as follows: 

H{A, X) + H(A, Y) f H(X) + H(Y) + H(A). 


Thus it means that the average value of the logarithm of the ratio 

p(x)p(y)p(a) ^ 

p(a,x)p(a,y ) 

is less than or equal to 0. The average is computed with respect to the distribution p(a,x,y). Computing the 
average, we take into account only the triples a, x, y with positive probability. For such triples, both the numerator 
and denominator of ratio © are positive and hence its logarithm is well defined. 

Now consider a new distribution p' where 


p'(a, x, y) 


p(a,x)p(a,y) 

pA) 

0 


if p(a) > 0, 

otherwise. 


Random variables distributed according to p’ can be generated by the following process: First generate a using 
the original distribution of A, then generate independently x using the conditional distribution x\a and y using the 
conditional distribution y\a. 

Notice that p'(a,x,y) is positive if so is p(a,x,y) but not the other way around. However, ratio © is still well 
defined and positive for all triples a,x,y with positive p'(a,x,y). Therefore we can compute the average value 
of the logarithm of © using the distribution p' in place of p. Moreover, changing the distribution does not affect 
the average. Indeed, the logarithm of © is the sum of logarithms of its factors. Thus it suffices to show that the 
average of the logarithm of each factors is not affected when we replace p by p'. Let us prove this, say, for the 
factor 1 /p(a,x). 

This factor does not depend on y. Therefore the average of its logarithm does not depend on how p(a, x) is 
split among p(a,x,y ) for different values y: we just sum up log 1 /p(a,x) over all a,x with weights p(a,x). As 
p(a,x ) = p'(a,x), summing with weights p'(a,x ) will yield the same result. 

By Jensen’s incqualit\[] the average value of the logarithm of the ratio © with respect to the distribution p' is 
at most 

log ( pi x )p{y)\ 

a,x,y.p' (a,x,y)>0 


1 We need Jensen’s inequality for logarithmic function: let pi, 

\og{piXi H- +p n x„). 


, p n be positive numbers that sum up to 1; then pi log xi + ■ ■ • +p n log x n ^ 
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The condition © guarantees that for each x. y there is at most one a with p(a,x ) > 0. p(a, y) > 0 and hence 

i°g( X p( x )p(y)) < lo s (^p(x)p(p)) = logi = o. 

a,x,y:p'(a,x,y)>0 x > V 

IV. Combinatorial applications of Theorem Q] 

A. A lower bound for the number of matchings in a bipartite graph 
From Theorem [Tj we can derive the following combinatorial statement: 

Corollary 1. Assume that edges of a bipartite graph are partitioned into K matchings so that for each pair 

(left vertex x, right vertex y) 

there is at most one matching in the partition that involves both x and y. Assume further that the degree of each 
left vertex is at least L and the degree of each right vertex is at least R. Then K f LR. 

Proof: Let Mi,..., ATk denote the given matchings. Consider the uniform distribution on the set of edges of 
the graph. Denote by (A, X, Y ) the following triple of jointly distributed random variables: 

X = [the left end of the edge], 

Y = [the right end of the edge], 

A = [the index i of the matching AT, containing the edge]. 

The conditions of the corollary imply that the triple ( A,X,Y ) satisfies ©: if both events [A = a,X = x] and 
[A = a, Y = y] have positive probabilities, then both x and y are involved in the matching M a , and hence such a 
matching M a is unique. Therefore by Theorem Q] we have H(A\X) + H(A\Y) f H(A). 

Notice that for each matching M a the probability Pr [/I = o] is proportional to the size of the matching, i.e., to 
the number of edges (equivalently, to the number of vertices) involved in this matching. On the other hand, for each 
vertex x involved in a matching M a , the conditional probability Pr V = x\A = a] is inversely proportional the size 
of the matching (the left ends of all edges in the matching have the same probabilities). It follows that for every 
fixed vertex x all the matchings M a covering x are equiprobable. In other words, conditional on X = x, the value 
of A is uniformly distributed on the set of all matching that contain an edge incident to x. Thus, H(A\X) f log L. 
Similarly, we have H(A\Y) f log R. Hence, H(A) f log L + log II. It follows that the range of A is at least LR. 


B. A lower bound for the biclique covering number of bipartite graphs 

Definition 1. For any bipartite graph G = (I j . Vj, E) (with the set of vertices V) U Vy and a set of edges E C Vjx Vf) 
its biclique covering number bcc(G ) is defined as the minimal number of bicliques (complete bipartite subgraphs) 
that cover all edges of G. 

Biclique coverings play an important role in communication complexity. Specifically, the non-deterministic 
communication complexity (see |[7ll) of a predicate 

P : U x U ->■ {0,1} 

can be defined as log bcc(G) for the bipartite graph G = (I j, V->. E), where Vj = If = U, and E is the set of all 
pairs (x,y) € (7 x U such that P(x, y) = 1. 

Corollary 2. Assume that edges of a bi-partite graph G = (I j, V'j, E) are colored in such a way that 
(*) for every biclique C and for every edges (x',y) and (x. y') from C of the same color a, the edge (x. y ) also 
has color a. 

Assume further that a probability distribution over the edges of the graph is given. Denote by ( X, Y, A) the random 
variables where 

• X = [ the left end of the edge], 

• Y = \the right end of the edge], 

• A = [the color of the edge]. 

Then bcc(G) > . 
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Proof: Assume that this graph G can be covered by t bicliques C\..... Ct■ Extend the distribution ( X , 1", A) 
and add another random variable: we define Z as the index of a biclique C t that covers the edge (X, Y"). (If 
an edge belongs to several bicliques Ci, then we choose any of them.) Notice that Z ranges over {1,..., t \, so 
H(Z)^\ogt. 

The crucial point is that for a fixed value i of Z the condition © is satisfied. Indeed, assume that both p(a, x\Z = 
i ) and p(a,y\Z = i) are positive. Then the biclique Ci has edges ( x,y') and ( x’,y ), both with color a. By property 
(*) the color of the edge (x. y) also equals a and hence such a is unique. By Theorem Q] for each conditional 
distribution (A, X, Y)\Z = i the inequality ([TJ holds. Hence we get 

H(A\X, Z) + H(A\Y, Z) ^ H(A\Z). 

It follows that 

H(A\X) - H(Z) + H(A\Y) - H(Z) ^ H(A). 

Thus, we obtain t > 2 H W > 2 m^)+H(A\Y)-H { A)] _ B 

Let us apply this corollary to a specific bipartite graph. Consider the bipartite graph G ri j- = (Vj . V 2 ,E), where 
both parts V\ and V 2 consist of fc-elements subsets of {1,... , n}, and the set of edges E C V\ x V 2 consists of all 
pairs of disjoints sets. Let us color the edge ( x, y) in color ill y and consider the uniform probability distribution 
over the edges of this graph. The condition (*) is fulfilled. Indeed, assume we are given three disjoint pairs of 
disjoint £:-element subsets: (x,y), ( x,y') and (x',y). Assume further that x U y' = x' U y = a. Then x = x' and 
y = y' and hence x U y = a as well. Hence 

bcc(G n , k ) > 2 ^W+^ A l y )-^ A )]. 

We have (ffj equiprobable colors and hence H(A) = log 2 ( 2 ”J. On the other hand, H(A\X) = H(A\Y) = 

!o§2 { n k k )- Thus 

If n k then ( n f k ) / ( 2 k) c l° se to Ck) ~ and we obtain a lower bound about 2 k for bcc(G ntk ). 

This bound in itself is of no interest; the simple and standard fooling set technique (see Q) proves for this 
graph the bound bcc(G) f ( 2/l ) that holds for all n f 2k. However, this simple example illustrates the connection 
between biclique covering and conditional information inequalities. It remains unknown whether a similar technique 
can surpass the fooling set method for other examples of graphs. 

V. A GENERALIZATION OF CONDITIONAL INEQUALITY FROM ifTOll 

The so called Ingleton inequality was originally formulated and proven proven for ranks of linear subspaces, |[Q. 
It turns out that a counterpart of this inequality reformulated in terms of Shannon’s entropy (for random variables) 
has many nontrivial applications. Though in general this inequality is not valid for entropies (see 01), it holds for 
distributions that satisfy some special properties (e.g., for random variables that enjoy the property of extracting the 
mutual information, or for variables with some properties of independence, see, e.g., 0, 0, 0, Q). In particular, 
in IfTOll it was shown that Ingleton’s inequality for entropies holds for all distributions where the entropies satisfy 
some linear constraints: 

Theorem 2 (ifTOlO. If random variables X, Y, A, B satisfy the the constraints 

I{X:Y\A) = H(A\X,Y) = 0, (4) 

then Ingleton’s inequality 

I {A : B) f I (A : B\X) + I (A : B\Y) + I(X : Y). (5) 


holds for this distribution. 
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A noteworthy fact is that this result cannot be obtained as a direct implication of any unconditional linear 
inequality for Shannon’s entropy. More precisely, whatever pair of reals Ai, A 2 we take, the inequality 

I (A : B ) ^ I(A : B\X) +1 {A : B\Y) + I(X : Y)+ 

+X 1 I(X :Y\A) + X 2 H(A\X,Y) 

does not hold for some distributions, see m. 

Ingleton’s inequality © can be equivalently rewritten as 

H(A\X, B) + H(A\Y, B) ^ H(A\B) + I(X : Y\A) + H(A\X, 1'). 

Under the constraints ©, this inequality is equivalent to 

H(A\X, B) + H(A\Y, B ) ^ H(A\B). (6) 

Therefore Theorem [2] can be reformulated as follows: 

I(X : Y\A) = H(A\X , Y) = 0 => H(A\X, B ) + H(A\Y, B ) sC H(A\B). (7) 

Notice that the right hand side of ([7]) (i.e., the inequality ©) is a relativized version of © (the word relativization 
here means that we insert new condition in all entropy expressions). Because of this similarity we can deduce 
Theorem [2] from our Theorem © To show this, we need to explain two implications: 

(;) the condition of (JT]) (i.e., the constraint I©) implies the condition ©, 

(ii) the condition © implies (due to Theorem© the inequality ([6]) (i.e., the conclusion of ©). 

First we prove (;). 

Lemma 1 . If a tuple of random variables (A. X. : Y) satisfies ©, then it satisfies also ©. 

Proof: Inequality I(X : Y\A) = 0 means that 

p(a, x, y)p(a) = p(a, x)p(a, y) 

for all triples a, x, y. Thus it implies that for each triple a, x, y of values of A, X, Y, if both probabilities p(a, x ) 
and p(a,y) are positive, then p(a,x,y) is also positive. Hence, if for some a,a',x, y all the four probabilities 

p(a, x),p(a, y),p(a',x),p(a',y) 

are positive (the assumption of ©), then it follows that the probabilities p(a,x,y) and p(a',x,y) must be also 
positive. 

Now we employ the condition H(A\X, Y) = 0 (which means that the value of A is a deterministic function of 
(X. Y)). If both probabilities p(a, x, y) and p(a', x, y) are positive, then a = a', and we obtain the conclusion of 

©. ■ 

Now we proceed to (ii) and explain how to deduce © from ©. The key observation is that the condition © 
is “relativizable”: the property © remains true if we restrict the initial probabilistic space to some subspace. 

Lemma 2 . If a tuple of random variables (A,X,Y) satisfies ©, then for each event £ having positive probability 
the conditional random variables of (A, X,Y)\£ satisfy ©. 

Proof: Assume that the four probabilities 

Pr[X = x, A = a |£],Pr[Y = y, A = a |£], 

Pr[X = x, A = a'|£],Pr[Y = y, A = a'\S] 

are positive. Then the unconditional probability of each of these events is positive as well and hence a = a! by ©. 

■ 

Now we see that © follows from the condition ©. Indeed, for every possible value b of B Lemma [2] guarantees 
that © remains valid conditional on the event B = b. By Theorem ©this implies 

H(A\X, B = b) + H(A\Y, B = b) sC H(A\B = 6), 

and taking the average over all values b we get ©. This claim can be formulated as the following corollary of 
Theorem © (generalizing Theorem © in the form ©). 

Corollary 3. The inequality © holds for all random variables A. B. X. Y satisfying ©. 
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